Merge Gemma recipe with full finetune #668

RdoubleA · 2024-04-09T16:48:48Z

Context

The primary reason Gemma had its own recipe was due to weight tying, where the output projection = token embedding weights. This replicates the behavior of ReversibleEmbedding in Keras where you can use the embedding weight to project back from output dim to input dim. This also had implications in FSDP wrapping and initializing on meta device, you can see #630 and #616 for more discussion on that.

We can actually achieve the same "weight tying" by getting rid of the output projection altogether and using the embedding weight directly for the output (shout-out @pbontrager):

output = F.linear(h, self.tok_embeddings.weight).float()

This is more akin to how its done in GemmaCausalLM in Keras, where there's no output projection and the token embedding weight is used directly.

Changelog

Remove output projection from GemmaTransformerDecoder
Remove gemma_full_finetune_distributed.py recipe
Remove load_shared_weights_utils and save_shared_weights_utils
Remove special GEMMA cases in HF checkpointer
Remove various unused imports throughout torchtune/models/

Test plan

This run had nearly equivalent loss values to the gemma recipe on main:
tune run --nnodes 1 --nproc_per_node 4 full_finetune_distributed --config gemma/2B_full max_steps_per_epoch=5

1|5|Loss: 0.9243088364601135:   0%|                                                                                                           | 5/6501 
...
2|5|Loss: 1.2039588689804077:   0%|                                                                                                           | 5/6501 
...
3|5|Loss: 1.597070574760437:   0%|                                                                                                            | 5/6501

tune run --nnodes 1 --nproc_per_node 4 gemma_full_finetune_distributed --config gemma/2B_full max_steps_per_epoch=5

1|5|Loss: 0.9225602149963379:   0%|                                                                                                           | 5/6501 
...
2|5|Loss: 1.204840898513794:   0%|                                                                                                            | 5/6501 
...
3|5|Loss: 1.5983972549438477:   0%|                                                                                                           | 5/6501

Comparison with HF implementation:

pytorch-bot · 2024-04-09T16:48:51Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/668

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 706848b with merge base ff594c2 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

kartikayk

This looks really clean - thanks for making the change! I'll wait for you to debug the increasing loss and also add a fwd/bwd comparison with the reference implementation before accepting.

Also update the README and cite the original author?

RdoubleA requested review from ebsmothers and kartikayk April 9, 2024 16:48

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 9, 2024

kartikayk reviewed Apr 10, 2024

View reviewed changes

RdoubleA mentioned this pull request Apr 11, 2024

E2E Tutorial #690

Merged

RdoubleA added 4 commits April 15, 2024 08:07

initial commit

48c5960

test recipe and fix

fdfc112

fix launch commands

6afaa4a

lint fix

706848b

RdoubleA force-pushed the rafiayub/gemma_fix branch from a3a41f2 to 706848b Compare April 15, 2024 15:12

ebsmothers approved these changes Apr 15, 2024

View reviewed changes

RdoubleA merged commit 3f93b25 into main Apr 15, 2024
27 checks passed

RdoubleA deleted the rafiayub/gemma_fix branch April 15, 2024 15:48

SalmanMohammadi mentioned this pull request May 24, 2024

[RFC] TransformerDecoder refactor #1017

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge Gemma recipe with full finetune #668

Merge Gemma recipe with full finetune #668

RdoubleA commented Apr 9, 2024 •

edited

Loading

pytorch-bot bot commented Apr 9, 2024 •

edited

Loading

kartikayk left a comment

Merge Gemma recipe with full finetune #668

Merge Gemma recipe with full finetune #668

Conversation

RdoubleA commented Apr 9, 2024 • edited Loading

Context

Changelog

Test plan

pytorch-bot bot commented Apr 9, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/668

✅ No Failures

kartikayk left a comment

Choose a reason for hiding this comment

RdoubleA commented Apr 9, 2024 •

edited

Loading

pytorch-bot bot commented Apr 9, 2024 •

edited

Loading